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Abstract 


This paper aims to compare different reg¬ 
ularization strategies to address a com¬ 
mon phenomenon, severe overfitting, in 
embedding-based neural networks for 
NLP. We chose two widely studied neu¬ 
ral models and tasks as our testbed. 
We tried several frequently applied or 
newly proposed regularization strategies, 
including penalizing weights (embeddings 
excluded), penalizing embeddings, re¬ 
embedding words, and dropout. We also 
emphasized on incremental hyperparame¬ 
ter tuning, and combining different regu¬ 
larizations. The results provide a picture 
on tuning hyperparameters for neural NLP 
models. 


1 Introduction 


Neural networks have exhibited considerable po¬ 


tential in various fields (Krizhevsky et al., 2012 


Graves et al., 20131. In early years on neural 


NLP research, neural networks were used in lan¬ 
guage modeling (Bengio et al., 2003; Morin and 
Bengio, 2005 ^ Mnih and Hinton, 2009 1 ; recently, 


they have been applied to various supervised tasks, 


such as named entity recognition (Collobert and 
Weston, 2008 ), sentiment analysis (|Socher et al., 


|20111 |Mou et al., 2015), relation classification 


( Zeng et al., 2014f Xu et a l,, 2015| ), etc. In the field 
of NLP, neural networks are typically combined 
with word embeddings, which are usually first pre¬ 


trained by unsupervised algorithms like Mikolov 


et al. (20131; then they are fed forward to standard 


neural models, fine-tuned during supervised learn¬ 
ing. However, embedding-based neural networks 
usually suffer from severe overfitting because of 
the high dimensionality of parameters. 


*Equal contribution. ^ Corresponding author. 


A curious question is whether we can regular¬ 
ize embedding-based NLP neural models to im¬ 
prove generalization. Although existing and newly 
proposed regularization methods might alleviate 
the problem, their inherent performance in neural 
NLP models is not clear: the use of embeddings 
is sparse; the behaviors may be different from 
those in other scenarios like image recognition. 
Further, selecting hyperparameters to pursue the 
best performance by validation is extremely time- 
consuming, as suggested in jCollobert et al. (2011[ ). 
Therefore, new studies are needed to provide a 
more complete picture regarding regularization for 
neural natural language processing. Specifically, 
we focus on the following research questions in 
this paper. 

RQ 1: How do different regularization strategies 
typically behave in embedding-based neural 
networks? 

RQ 2: Can regularization coefficients be tuned in¬ 
crementally during training so as to ease the 
burden of hyperparameter tuning? 

RQ 3: What is the effect of combining different 
regularization strategies? 

In this paper, we systematically and quan¬ 
titatively compared four different regularization 
strategies, namely penalizing weights, penalizing 
embeddings, newly proposed word re-embedding 
(Labutov and Lipson, 2013), and dropout (Srivas 


tava et al., 2014). We analyzed these regulariza¬ 


tion methods by two widely studied models and 
tasks. We also emphasized on incremental hyper¬ 
parameter tuning and the combination of different 
regularization methods. 

Our experiments provide some interesting re¬ 
sults: (1) Regularizations do help generalization, 
but their effect depends largely on the datasets’ 
size. (2) Penalizing ( 2 -norm of embeddings helps 
optimization as well, improving training accu¬ 
racy unexpectedly. (3) Incremental hyperparam¬ 
eter tuning achieves similar performance, indicat- 





































ing that regularizations mainly serve as a “local” 
effect. (4) Dropout performs slightly worse than 
£2 penalty in our experiments; however, provided 
very small 1 2 penalty, dropping out hidden units 
and penalizing ^ 2 -norm are generally complemen¬ 
tary. (5) The newly proposed re-embedding words 
method is not effective in our experiments. 


2 Tasks, Models, and Setup 

Experiment I: Relation extraction. The dataset 
in this experiment comes from SemEval-2010 
Task 80 The goal is to classify the relationship 
between two marked entities in each sentence. We 
refer interested readers to recent advances, e.g., 


Hashimoto et al. (2013), Zeng et al. (2014), and 


|Xu et al. (2015| ). To make our task and model 
general, however, we do not consider entity tag¬ 
ging information; we do not distinguish the order 
of two entities either. In total, there are 10 labels, 
i.e., 9 different relations plus a default other. 

Regarding the neural model, we applied Col- 
lobert’s convolutional neural network (CNN) 
( jCollobert and Weston, 2008] ) with minor modi¬ 
fications. The model comprises a fixed-window 
convolutional layer with size equal to 5, 0 padded 
at the end of each sentence; a max pooling layer; 
a tanh hidden layer; and a softmax output layer. 

Experiment II: Sentiment analysis. This is 
another testbed for neural NLP, aiming to pre¬ 
dict the sentiment of a sentence. The dataset is 
the Stanford sentiment treebank ( |Socher et al., 
201 if 1 2 target labels are strongly/weakly 
positive/negative, or neutral. 

We used the recursive neural network (RNN), 
which is proposed in|Socher et al. (20TT[), and fur¬ 


ther developed in Socher et al. (2012); Irsoy and 


Cardie (2014). RNNs make use of binarized con¬ 


stituency trees, and recursively encode children’s 
information to their parent’s; the root vector is fi¬ 
nally used for sentiment classification. 

Experimental Setup. To setup a fan - compari¬ 
son, we set all layers to be 50-dimensional in ad¬ 
vance (rather than by validation). Such setting has 
been used in previous work like Zhao et al. (2015). 
Our embeddings are pretrained on the Wikipedia 
corpus using Collobert and Weston (2008). The 
learning rate is 0.1 and fixed in Experiment I; 
for RNN, however, we found learning rate decay 
helps to prevent parameter blowup (probably due 


1 http://www.aclweb.org/anthology/S10-1006 

2 http://nlp.stanford.edu/sentiment/ 


to the recursive, and thus chaotic nature). There¬ 


fore, we applied power decay (Senior et al., 20131 
with power equal to —1. For each strategy, we 
tried a large range of regularization coefficients, 
10~ 9 , • • • , HP 2 , extensively from underfitting to 
no effect with granularity lOx. We ran the model 
5 times with different initializations. We used 
mini-batch stochastic gradient descent; gradients 
are computed by standard backpropagation. For 
source code, please refer to our project website 0 

It needs to be noticed that, the goal of this paper 
is not to outperform or reproduce state-of-the-art 
results. Instead, we would like to have a fair com¬ 
parison. The testbed of our work is two widely 
studied models and tasks, which were not chosen 
on puipose. During the experiments, we tried to 
make the comparison as fair as possible. There¬ 
fore, we think that the results of this work can be 
generalized to similar scenarios. 

3 Regularization Strategies 

In this section, we describe four regularization 
strategies used in our experiment. 

• Penalizing l^-norm of weights. Let E be the 
cross-entropy error for classification, and R 
be a regularization term. The overall cost 
function is J = E + A R, where A is the co¬ 
efficient. In this case, R = ||VL|| 2 , and the 
coefficient is denoted as Apy. 

• Penalizing f^-norm of embeddings. Some 
studies do not distinguish embeddings or 


connectional weights for regularization (Tai 


|et al., 2015] ). However, we would like to an¬ 
alyze their effect separately, for embeddings 
are sparse in use. Let <I> denote embeddings; 
then we have R = ||4>|| 2 . 


Re-embedding words (Labutov and Lipson, 


2013). Suppose 4>o denotes the original em¬ 


beddings trained on a large corpus, and 4? de¬ 
notes the embeddings fine-tuned during su¬ 
pervised training. We would like to penalize 
the norm of the difference between 4>o and 4>, 
i.e., R = ||4>o — <I> 11 2 . In the limit of penalty to 
infinity, the model is mathematically equiv¬ 
alent to “frozen embeddings,” where word 
vectors are used as surface features. 

Dropout (Srivastava et al., 2014). In this 
strategy, each neural node is set to 0 with a 
predefined dropout probability p during train¬ 
ing; when testing, all nodes are used, with ac¬ 
tivation multiplied by 1 — p. 


1 https://sites.google.com/site/regembeddingnn/ 
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(c) Penalizing embeddings in Experiment I. 




(g) Applying dropout in Experiment I. p = 0.5,0.6 
are omitted because they are similar to small values. 






Figure 1 : Averaged learning curves. Left: Experiment I, relation extraction with CNN. Right: Experiment II, sentiment 
analysis with RNN. From top to bottom, we penalize weights, penalize embeddings, re-embed words, and drop out. Dashed 
lines refer to training accuracies; solid lines are validation accuracies. 





































































4 Individual Regularization Behaviors 


5 Incremental Hyperparameter Tuning 


This section compares the behavior of each strat¬ 
egy. We first conducted both experiments with¬ 
out regularization, achieving accuracies of 54.02± 
0.84%, 41.47 ± 2.85%, respectively. Then we plot 
in Figure [T] learning curves when each regulariza¬ 
tion strategy is applied individually. We report 
training and validation accuracies through out this 
paper. The main findings are as follows. 

• Penalizing f^-norm of weights helps gener¬ 
alization; the effect depends largely on the 
size of training set. Experiment I contains 
7,000 training samples and the improvement 
is 6.98%; Experiment II contains more than 
150k samples, and the improvement is only 
2.07%. Such results are consistent with other 
machine learning models. 


• Penalizing t^-norm of embeddings unexpect¬ 
edly helps optimization (improves training 
accuracy). One plausible explanation is that 
since embeddings are trained on a large cor¬ 
pus by unsupervised methods, they tend to 
settle down to large values and may not per¬ 
fectly agree with the tasks of interest. £2 
penalty pushes the embeddings towards small 
values and thus helps optimization. Regard¬ 
ing validation accuracy, Experiment I is im¬ 
proved by 6.89%, whereas Experiment II has 
no significant difference. 


Re-embedding words does not improve gen¬ 
eralization. Particularly, in Experiment II, 
the ultimate accuracy is improved by 0.44, 
which is not large. Further, too much penalty 
hurts the models in both experiments. In the 
limit A r eembed to infinity, re-embedding words 
is mathematically equivalent to using embed¬ 
dings as surface features, that is, freezing em¬ 
beddings. Such strategy is sometimes applied 
in the literature like |Hu et al. (2014 1 , but is 
not favorable as suggested by the experiment. 


Dropout helps generalization. Under the best 
settings, the eventual accuracy is improved 
by 3.12% and 1.76%, respectively. In our ex¬ 
periments, dropout alone is not as useful as 
£2 penalty. However, other studies report that 


dropout is very effective (Irsoy and Cardie, 


2014). Our results are not consistent; differ¬ 


ent dimensionality may contribute to this dis¬ 
agreement, but more experiments are needed 
to confirm the hypothesis. 


The above experiments show that regularization 
generally helps prevent overfitting. To pursue the 
best performance, we need to try out different hy¬ 
perparameters through validation. Unfortunately, 
training deep neural networks is time-consuming, 
preventing full grid search from being a practical 
technique. Things will get easier if we can incre¬ 
mentally tune hyperparameters, that is, to train the 
model without regularization first, and then add 
penalty. 

In this section, we study whether lo penalty of 
weights and embeddings can be tuned incremen¬ 
tally. We exclude the dropout strategy because its 
does not make much sense to incrementally drop 
out hidden units. Besides, from this section, we 
only focus on Experiment I due to time and space 
limit. 

Before continuing, we may envision several 
possibilities on how regularization works. 

• (On initial effects) As I 2 -norm prevents pa¬ 
rameters from growing large, adding it at 
early stages may cause parameters settling 
down to local optima. If this is the case, de¬ 
layed penalty would help parameters get over 
local optima, leading to better performance. 

• (On eventual effects) i 2 penalty lifts er¬ 
ror surface of large weights. Adding such 
penalty may cause parameters settling down 
to (a) almost the same catchment basin, or (b) 
different basins. In case (a), when the penalty 
is added does not matter much. In case (b), 
however, it makes difference, because param¬ 
eters would have already gravitated to catch¬ 
ment basins of larger values before regular¬ 
ization is added, which means incremental 
hyperparameter tuning would be ineffective. 

To verify the above conjectures, we design four 
settings: adding penalty (1) at the beginning, (2) 
before overfitting at epoch 2, (3) at peak perfor¬ 
mance (epoch 5), and (4) after overfitting (valida¬ 
tion accuracy drops) at epoch 10. 

Figure [2]plots the learning curves regarding pe¬ 
nalizing weights and embeddings, respectively; 
baseline (without regularization) is also included. 

For both weights and embeddings, all settings 
yield similar ultimate validation accuracies. This 
shows I 2 regularization mainly serves as a “local” 
effect—it changes the error surface, but parame¬ 
ters tend to settle down to a same catchment basin. 
We notice a recent report also shows local optima 











(a) Incrementally penalizing l<i -norm of weights. (b) Incrementally penalizing € 2 -norm of biases. 


Figure 2: Tuning hyperparameters incrementally in Experiment I. Penalty is added at epochs 0, 2, 5, 10, 
respectively. We chose the coefficients yielding the best performance in Figure [T] The controlled trial 
(no regularization) is early stopped because the accuracy has already decreased. 
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Table 1: Accuracy in percentage when we com¬ 
bine ^ 2 -norm of weights and embeddings (Exper¬ 
iment I). Bold numbers are among highest accu¬ 
racies (greater than peak performance minus 1.5 
times standard deviation, i.e., 1.26 in percentage). 
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Table 2: Combining 1 2 regularization and dropout. 
Left: connectional weights. Right: embeddings. 
(p refers to the dropout rate.) 


may not play an important role in training neural 
networks, if the effect of parameter symmetry is 
ruled out ( |Breuel, 2015! ). 

We also observe that regularization helps gener¬ 
alization as soon as it is added (Figure [2ji), and that 
regularizing embeddings helps optimization also 
right after the penalty is applied (Figure [2]}). 


6 Combination of Regularizations 

We are further curious about the behaviors when 
different regularization methods are combined. 

Table [I] shows that combining ^ 2 -norm of 
weights and embeddings results in a further accu¬ 
racy improvement of 3^4 percents from applying 


either single one of them. In a certain range of 
coefficients, weights and embeddings are comple¬ 
mentary: given one hyperparameter, we can tune 
the other to achieve a result among highest ones. 

Such compensation is also observed in penal¬ 
izing ( 2 -nomi versus dropout (Table [2])— although 
the peak performance is obtained by pure 1 2 regu¬ 
larization, applying dropout with small 1 2 penalty 
also achieves a similar accuracy. The dropout rate 
is not very sensitive, provided it is small. 

7 Discussion 

In this paper, we systematically compared four 
regularization strategies for embedding-based 
neural networks in NLR Based on the experimen¬ 
tal results, we answer our research questions as 
follows. (1) Regularization methods (except re¬ 
embedding words) basically help generalization. 
Penalizing £ 2 -norm of embeddings unexpectedly 
helps optimization as well. Regularization perfor¬ 
mance depends largely on the dataset’s size. (2) 
I 2 penalty mainly acts as a local effect; hyperpa¬ 
rameters can be tuned incrementally. (3) Combin¬ 
ing ^ 2 -norm of weights and biases (dropout and i<i 
penalty) further improves generalization; their co¬ 
efficients are mostly complementary within a cer¬ 
tain range. These empirical results of regulariza¬ 
tion strategics shed some light on tuning neural 
models for NLR 
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